Batch Document Filtering Using Nearest Neighbor Algorithm

نویسندگان

  • Ali Mustafa Qamar
  • Éric Gaussier
  • Nathalie Denos
چکیده

This paper describes the participation of LIG lab, in the batch filtering task for the INFILE (INformation FILtering Evaluation) campaign of CLEF 2009. As opposed to the online task, where the server provides the documents one by one, all of the documents are provided beforehand in the batch task, which explains the fact that feedback is not possible in the batch task. We propose in this paper a batch algorithm to learn category specific thresholds in a multiclass environment where a document can belong to more than one class. The algorithm uses k-nearest neighbor algorithm for filtering the 100,000 documents into 50 topics. The experiments were run on the English corpus. Our experiments gave us a precision of 0.256 while the recall was 0.295. We had participated in the online task in INFILE 2008 where we had used an online algorithm using the feedbacks from the server. In comparison with INFILE 2008, the recall is significantly better in 2009, 0.295 vs 0.260. However the precision in 2008 were 0.306. Furthermore, the anticipation in 2009 was 0.43 as compared with 0.307 in 2008.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Working Notes for the InFile Campaign : Online Document Filtering Using 1 Nearest Neighbor

This paper has been written as a part of the InFile (INFormation, FILtering, Evaluation) campaign. This project is a crosslanguage adaptive filtering evaluation campaign, sponsored by the French national research agency, and it is a pilot track of the CLEF (Cross Language Evaluation Forum) 2008 campaigns. We propose in this paper an online algorithm to learn category specific thresholds in a mu...

متن کامل

An Improved K-Nearest Neighbor with Crow Search Algorithm for Feature Selection in Text Documents Classification

The Internet provides easy access to a kind of library resources. However, classification of documents from a large amount of data is still an issue and demands time and energy to find certain documents. Classification of similar documents in specific classes of data can reduce the time for searching the required data, particularly text documents. This is further facilitated by using Artificial...

متن کامل

Instance-Based Spam Filtering Using SVM Nearest Neighbor Classifier

In this paper we evaluate an instance-based spam filter based on the SVM nearest neighbor (SVM-NN) classifier, which combines the ideas of SVM and k-nearest neighbor. To label a message the classifier first finds k nearest labeled messages, and then an SVM model is trained on these k samples and used to label the unknown sample. Here we present preliminary results of the comparison of SVM-NN wi...

متن کامل

An Improved K-Nearest Neighbor with Crow Search Algorithm for Feature Selection in Text Documents Classification

The Internet provides easy access to a kind of library resources. However, classification of documents from a large amount of data is still an issue and demands time and energy to find certain documents. Classification of similar documents in specific classes of data can reduce the time for searching the required data, particularly text documents. This is further facilitated by using Artificial...

متن کامل

Drought Monitoring and Prediction using K-Nearest Neighbor Algorithm

Drought is a climate phenomenon which might occur in any climate condition and all regions on the earth. Effective drought management depends on the application of appropriate drought indices. Drought indices are variables which are used to detect and characterize drought conditions. In this study, it was tried to predict drought occurrence, based on the standard precipitation index (SPI), usin...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009